When
a failure or disaster strikes is when not only having, but also
following, a disaster recovery plan is most important. Having a
procedure or checklist to follow allows all involved parties to be on
the same page and understand what steps are being taken to rectify the
situation. The following sections detail steps that can be followed to
ensure that no time is wasted and resources are not being led in the
wrong direction.
Qualifying the Disaster or Failure
When a system failure
occurs or is reported as failed, the information can come from a number
of different sources and should be verified. The reported issue can be
caused by user or operator error, network connectivity, or a problem
with a specific user account configuration or status. A reported system
failure should be verified as failed by performing the same steps
reported by the reporting party.
If the system is, in fact, in
a failed state, the impact of the failure should be noted, and this
information should be escalated within the organization so that a
formal recovery plan can be created. This can be known as qualifying
the disaster or failure. An example of qualifying a failure includes a
short description of the failure, the steps used to validate the
failure, who is affected, how many end users are affected, which
dependent applications or systems are affected, which branch offices
are affected, and who is responsible for the maintenance and recovery
of this system.
Validating Priorities
When
a disaster strikes that affects an entire server room or office
location, the priority of restoring systems and operations should
already be determined. First and foremost are the core infrastructure
systems, such as networking and power, followed by authentication
systems, and the remaining core bare minimum services. In the event of
a failure that involves multiple systems—for example, a web server
failure that supports 10 separate applications—the priority of recovery
should be presented and approved by management. If each of these 10
applications takes 30 minutes to recover, it could be 5 hours before
the system is fully functional, but if one particular application is
critical to business operations, this application should be recovered
first. Always perform checkpoints and verification to ensure that the
priorities of the organization are in line with the recovery work that
is being performed.
Assume and Be Doomed
Disaster, system failures,
and data corruption issues tend to create a lot of stress and havoc
among technical business personnel. Recovery administrators and
managers should always be on the same page regarding the priority of
recovery and the process. Also, get this communication in paper or
electronic format because it might be required later to justify why a
choice was made. Those administrators who decide to move forward on
resolving an issue based on assumptions and not by first communicating
with their managers might find themselves in a very sticky situation,
especially if the results of their actions prove to be unsuccessful or
end up causing more problems.
Synchronizing with Business Owners
Prioritizing the
recovery of critical and bare minimum business systems is part of
disaster recovery planning. When a situation strikes that requires an
entire data center or group of systems to be restored or recovered, the
steps that will be followed need to be put back in front of the
business owners again. Please remember that between the time a disaster
recovery plan is created and the time the failure occurs, business
priorities might have shifted and the business owners might be the only
ones aware of this change. During a recovery situation, always take the
time to stay calm and focused and communicate with the managers,
executives, and business owners so that they can be informed of the
progress. An informed business owner is less likely to stay in the
server room or data center if they feel that recovery efforts are in
good hands.
Communicating with Vendors and Staff
When failures or
disasters strike, communication is key. Regardless of whether
customers, vendors, employees, or executives are affected, some level
of communication is required or suggested. This is where the soft
skills of an experienced manager, sales executive, technical
consultant, and possibly even lawyers can be most valuable. Providing
too much information, information that is too technical, or, worst of
all, incorrect or no information, is a mistake technical staff
frequently make. My recommendation to technical staff is to only
communicate with your direct manager or his or her boss if they are not
available. If the CEO
or an end user asks for an update, try to defer to the manager as best
you can, so that focus can be kept on restoring services.
Assigning Tasks and Scheduling Resources
The situation is that we have
a failure, we have an approved plan, we have communicated the
situation, and we are ready to begin fixing the issue. The next step is
to delegate the specific tasks to the qualified staff members for
execution. As stated previously, hand off communication to a manager or
spokesperson and only communicate through them if possible. Determining
who will restore a particular system is as important if not more
important than assigning communication responsibilities. Only certain
technical staff members might be qualified to restore a system, so
selecting the correct resource is essential.
When a serious failure
has occurred, recovery efforts might require multiple technical
resources onsite for an extended period of time. Furthermore, there
might be dependencies that affect which systems can be restored, and,
of course, the order or priority of restore will advance or delay the
recovery of a system. Mapping out the extended recovery timeline and
technical resource scheduling ensures that a technical resource is not
onsite until their skills and time are required. Also, rotating
technical resources after six to eight hours of time helps to keep
progress moving forward.
Keeping the Troops Happy
This section goes out to
all technical leads, project managers, IT managers, business owners,
and executives. If you have technical resources working for you in an
effort to recover from a failure, you should do all you can to ensure
that these technical resources are kept happy and focused. For
starters, try to keep the end users and any other business owners or
executives from bothering this staff. Regular communication will help
with this task tremendously. Next, and possibly more important, provide
all the bottled water, soda, coffee, snacks, food, breaks, and anything
else that will keep these professionals happy, healthy, and focused on
the task at hand. Technical staff will work very hard during disaster
situations, so don’t forget to pat them on the back and let them know
how much the organization and you personally appreciate their time and
commitment.
Recovering the Infrastructure
After the failure has been
validated, the initial communications meetings have been held, restore
tasks have been confirmed and possibly reprioritized, and recovery task
assignment of resources has been completed, the recovery efforts can
finally begin. Verify that each technical resource has all the
documentation, phone numbers, software, and hardware they require to
perform their task. Hold periodic checkpoint meetings, starting every
15 minutes and tapering off to every 30 or 60 minutes as recovery
efforts continue.
Postmortem Meeting
After a system failure or
disaster strikes, and the recovery has been completed, an organization
should hold a meeting to review the entire process. The meeting might
just be an event where individuals are recognized for their great work;
however, the meeting will most
likely involve reviewing what went wrong and identifying how the
process could be improved in the future. A lot of interesting things
will happen during disaster recovery situations—both unplanned and
simulated—and this meeting can provide the catalyst for ongoing
improvement of the processes and documentation.